SITE LINK

KMID : 1022420190110020057

Phonetics and Speech Sciences
2019 Volume.11 No. 2 p.57 ~ p.64

Performance comparison of various deep neural network architectures using Merlin toolkit for a Korean TTS system

Hong Jun-Young

Kwon Chul-Hhong

Abstract

In this paper, we construct a Korean text-to-speech system using the Merlin toolkit which is an open source system for speech synthesis. In the text-to-speech system, the HMM-based statistical parametric speech synthesis method is widely used, but it is known that the quality of synthesized speech is degraded due to limitations of the acoustic modeling scheme that includes context factors. In this paper, we propose an acoustic modeling architecture that uses deep neural network technique, which shows excellent performance in various fields. Fully connected deep feedforward neural network (DNN), recurrent neural network (RNN), gated recurrent unit (GRU), long short-term memory (LSTM), bidirectional LSTM (BLSTM) are included in the architecture. Experimental results have shown that the performance is improved by including sequence modeling in the architecture, and the architecture with LSTM or BLSTM shows the best performance. It has been also found that inclusion of delta and delta-delta components in the acoustic feature parameters is advantageous for performance improvement.

KEYWORD

deep neural networks, Merlin toolkit, text-to-speech (TTS)

FullTexts / Linksout information

Listed journal information

site infomation

Prohibition of Unauthorized Collection of E-mail Addresses, medric.kyung@gmail.com
N4 301, Chungbuk National University, Chungdae-ro 1, Seowon-Gu, Cheongju, Chungbuk 28644, Korea